As an important indicator for comprehensive national strength, educational level information is collected at each Censuses of Population. Educational level is not only the embodiment of the advanced nature, but also an important influencing factor of social stability. The temporal changes and spatial distribution of educational level are useful reference for investment, school district selection, and employment. However, it is not easy for a citizen to dig up this information and have a visual perception. Even the government website might provide raw dataset of native historical education level. Citizen without basic statistical training can only get limited information from raw data that government’s website provided. Besides, most of the raw data is huge and messy, it is nearly impossible for an untrained citizen to get information from. In this study, we aim to provide a convenient visualization tool to help people obtain and analysis the educational level for the US. The raw data is from the Census Bureau of US. Data technologies are needed to clean the data into a long tidy version, and check the completeness of the data.
The main deliverable products of this study will be an R package called “Edulevel”, and a corresponding shiny app for people who are not familiar of using R package. The package and the shiny app contains 25 years of education level of every county in the US. First, the package has basic function of searching a database, which could gather information we want. For example, the package and the app can provide the educational level of the specific time and county, if the county name and year is specified. What’s more, the package can provide more detailed information after we define the requirement. The second function of this product is to provide simple graphical tools to analyze the educational level spatially from year to year. This package can help the users to identify the temporal changes and spatial distribution of educational level for an area of interest. Besides, a user could not only have plot comparing the educational level along with year of a specific location, but also have an intuitive map with gradient color which indicate the different educational level. The map could be a US map which has distinguished colored state, or a state map with gradient color indicate different counties.
The data is collected from the website of US government open source database (https://www.data.gov/). The dataset is an excel file that contains 3283 rows which indicate 3283 counties in US and 47 columns which are variables. Table 2.1 shows a part of the raw data. In the variables, seven of them are attributes of counties, 40 of them are educational level variables. The seven attributes are FIPS code, State, Area name (county), 2003 Rural-urban Continuum Code and so on. The FIPS code is unique for each county, so it could be kept helping users to select county in R. The “State” and “Area name” is kept helping users to select the specific area they are interested in. Since other four variables are not unique for each county and also not well known by users, they will not be used in the package or shiny app. The 40 variables contain four educational level which is “Less than a high school diploma”, “High school diploma only”, “Some college (1-3 years)”, “Four years of college or higher” and percentage of adults of these four levels. There are five data point for every educational level from 1970 to 2015 by every ten or five years. The 40 educational level variables will all be used and they will be classified as two types: count and percentage. Value of each variable means in the selected county at a specific time point, the number of people or the percentage of adults in a specific education level.
| FIPS Code | State | Area name | Less than a high school diploma, 1970 | High school diploma only, 1970 | Some college (1-3 years), 1970 | Four years of college or higher, 1970 | Percent of adults with less than a high school diploma, 1970 | Percent of adults with a high school diploma only, 1970 | Percent of adults completing some college (1-3 years), 1970 | Percent of adults completing four years of college or higher, 1970 |
|---|---|---|---|---|---|---|---|---|---|---|
| 00000 | US | United States | 52373312 | 34158051 | 11650730 | 11717266 | 47.7 | 31.1 | 10.6 | 10.7 |
| 01000 | AL | Alabama | 1062306 | 468269 | 136287 | 141936 | 58.7 | 25.9 | 7.5 | 7.8 |
| 01001 | AL | Autauga County | 6611 | 3757 | 933 | 767 | 54.8 | 31.1 | 7.7 | 6.4 |
| 01003 | AL | Baldwin County | 18726 | 8426 | 2334 | 2038 | 59.4 | 26.7 | 7.4 | 6.5 |
| 01005 | AL | Barbour County | 8120 | 2242 | 581 | 861 | 68.8 | 19.0 | 4.9 | 7.3 |
| 01007 | AL | Bibb County | 5272 | 1402 | 238 | 302 | 73.1 | 19.4 | 3.3 | 4.2 |
| 01009 | AL | Blount County | 10677 | 3440 | 626 | 404 | 70.5 | 22.7 | 4.1 | 2.7 |
The raw data might have problems for R to select information from even it looks like clean. There might be missing data and wrong data points. The hybrid column name is combination of educational level and year which is not straightforward for R to select either educational level or time point. The word description for the same educational level is not exactly the same. Besides, the format of the dataset is not convenient for users. So, several steps are needed to make the dataset tidy before the application stage.
First step is to revise the dataset into a tidy format. R Function “gather” is used to gather columns 8 which is first column of educational level variables to the last column into new column “values”, and previous column name which indicate the educational variables into new column “education”. Other seven attributes columns do not change. We also use R function “separate” to separate cells under Column “education” into Column “level” and “year”. For example, first row of Column “education” is “Less than a high school diploma, 1970”; it is separated into two columns: Column “level” which has value of “Less than a high school diploma” and Column “year” which has value of “1970”. In the same time, four columns of not useful attributes are deleted using R Function “select”. Then, we add a Column “type” with R Function “mutate” to distinguish if the value is a count of people or percentage of adults with an educational level. Thus, we already have an initial format of dataset. However, there are still several simple changes needed. R Function “parse_number” is used to drop the non-numeric characters before or after the first number in Column “FIPS”. R Function “trimws” is used to remove the whitespace in Column “year”.
To have a general knowledge of the dataset. We use R Function “table” to look at the educational level. It turns out the names of the same educational level are not consistent. R Function “group_by” and “tally” could make the same point clearer. For example, in 1970 and 1980 year, the highest educational level is “Four years of college or higher”, the highest educational level in 2000 is “Bachelor’s degree or higher”. These two descriptions should be classified as the same educational level, however. To matching the names of the same educational level, we use R Function “gsub”. Function “gsub” is used to select key word and replace the description of variables in Column “level” with the same appropriate names. After several attempts to try to describe all the levels with regular expressions, the educational levels are combined as four levels: “Bachelor’s degree or higher”, “Some college (1-3 years)”, “High school diploma only”, “Less than a high school diploma”.
On the other hand, the correctness and completeness of the data is needed to be checked. R Function “spread” is used to spread the “value” by “type”, which means spread the value of type “count” and “percentage” into two columns. For one specific FIPS code (county) and a year, the percentage for each level is the corresponding count divided by the total count. To check if the percentage variable is calculated correctly, the R Function “mutate” is used to add a Column “new_percentage” which is calculated from variable “count”. Check if the variable “new_percentage” is the same with the variable “percentage” and select these rows that are not the same with function “filter”. It turns out there are seven rows with percentage calculated incorrectly. Use R Function “mutate” to correct the variable “percentage”. Last, R Function “gather” is used to get back to tidy version of dataset (as shown in Table 2.2). Until this step, the dataset is tidy and correct to conduct further analyze.
| FIPS | State | Area | level | year | type | value |
|---|---|---|---|---|---|---|
| 0 | US | United States | Less than a high school diploma | 1970 | count | 52373312 |
| 1000 | AL | Alabama | Less than a high school diploma | 1970 | count | 1062306 |
| 1001 | AL | Autauga County | Less than a high school diploma | 1970 | count | 6611 |
| 1003 | AL | Baldwin County | Less than a high school diploma | 1970 | count | 18726 |
| 1005 | AL | Barbour County | Less than a high school diploma | 1970 | count | 8120 |
| 1007 | AL | Bibb County | Less than a high school diploma | 1970 | count | 5272 |
The main deliverable product of this project is an R package and a shiny app. The package is used for people who have experience on R and grasp the skills to apply simple R functions. The package called “Edulevel”, which contains one dataset and four functions. The dataset is the tidy data names “tidy” from Section 2 (showed in Table 2.2 as an example) which contains information of four educational level for adults (more than 25 years old) in US from 1970 to 2015. The dataset is stored using “LazyData”. Function “showcounty” will return a list of county names for a given state abbreviation, which will be used in other functions, and it also provide reference for users when using other functions. Function “lineEd” will plot a line chart (“count” or “percent”) of all (four levels) education levels for a given county. Function “mapstate” will give a map of the “count” or “percent” data on the scale of states in US for a given education level. Function “mapcounty” will give a map of the “count” or “percent” data on the scale of counties for a given state and a given education level. The plot function is this package all work well with “ggplotly”.
showcountyWhen using function “showcounty”, the user need to specify the two-letter abbreviated state name that is interested in. Since the state name is character, it has to be quoted in double quotes. For example, if we are interested in Iowa, the command is shown below. The output is a list of county names of Iowa state. The code below shows first 20 county names.
library(Edulevel)
library(tidyverse)
#show list of counties in Iowa state
showcounty("IA") [1:20]
## [1] "Adair County" "Adams County" "Allamakee County"
## [4] "Appanoose County" "Audubon County" "Benton County"
## [7] "Black Hawk County" "Boone County" "Bremer County"
## [10] "Buchanan County" "Buena Vista County" "Butler County"
## [13] "Calhoun County" "Carroll County" "Cass County"
## [16] "Cedar County" "Cerro Gordo County" "Cherokee County"
## [19] "Chickasaw County" "Clarke County"
lineEdFunction “lineEd” is used when one is interested in one specific county and want to analyze the trend educational level in the past years. The input of the function is the abbreviated state name and county name. The state name and county name should be quoted in double quotes. The county name has to be exactly the same with the county name in the dataset “tidy”. If the user is not clear with the spell of county name, function “showcounty” could help. The output of the function is a plot of four educational level of the specific county. The data is group by four educational level with four different color and facet by value type (count or percent). For example, to find the detailed educational level in past years of Story county in Iowa, we could use following command. In this output, the left plot is count (y-axis) of an educational level in Story county of the five time points from year 1970 to 2015 (x-axis). The right plot is percent of adults who have an educational level in Story county in the five time points from year 1970 to 2015. The different color means different educational level, e.g. the red line shows the amount of people who have Bachelor’s degree or higher of Story count.
#Plot educational level of Story county in Iowa
lineEd("IA", "Story County")
Please notice that if the input of “countyname” is “Story” instead of “Story County”, R will give an error message. We could see from the above output plot that the people with Bachelor’s degree or higher (red line) is increasing rapidly over the years. The percentage of adults with a Bachelor’s degree or higher is increased by about 20% percentage in 35 years. The people who has been to college is also increasing. Meanwhile, the people who has high school diploma or lower level is decreasing. The figure tells us the education level in Story county is stepping into a good direction. It seems that without considering the education progress in US, Story county is a pretty good place to stay if parents are emphasis on children’s education.
Using the function “lineEd” we can also detect differences from county to county. Actually, the education level of Story county is increasing rapidly after considering the education in other counties in US. Figure below shows the educational level of Liberty county in Florida. The people with high school diploma is increasing rapidly, meanwhile the people with Bachelor’s degree is increasing pretty slowly. Also, there are other counties that people with high education level is decreasing and population that only been to high school increasing, such as Yukon-Koyukuk Census area in Alaska. From the plot of story county, we can see for both count and percentage, there is an obvious increase for bachelor’s degree level and some college level from 1970 to 2015, while for counties like Liberty county in Florida or Yukon-Koyukuk Census area in Alaska, the percentage of people who have a bachelor’s degree or higher didn’t change too much along the time.
# Plot educational level of Liberty county in Florida
lineEd("FL", "Liberty County")
mapstateFunction “mapstate” provide a summary map of education status of US. The input parameters are “type” and “level”. Parameter “type” is either “count” or “percent”. There are four levels, with numeric number 1 to 4 to represent four educational levels from “Less than a high school diploma”, “High school diploma only”, “Some college (1-3 years)”, “Bachelor’s degree or higher”. For example, the following command is asking for the layout of percentage of adults with Bachelor’s degree in US. The output is a series of graphics which are plots from five time points of percent of adults with Bachelor’s degree or higher. Each of the US map plot is build up by unit of state. Deep color means high percent value, and light color means low percent value. For example, the code below is asking for percent of adults who have Bachelor’s degree or higher.
From this plot, it is clear that the percentage of people who have a bachelor’s degree or higher went up from 1970 to 2015. With label of “ggplotly”, the state with highest percentage is Colorado. The percent of adults with Bachelor’s degree or higher in Colorado is up to 38.1% in 2015.
#General view of percent of adults who have Bachelor’s degree or higher.
library(plotly)
ggplotly(mapstate("percent", 4))
The code below shows the US map of people with less than a high school diploma. It is clear that from the type of “percent” map, the trend of percent of people with this educational level is decreasing. However, from the map with “count” as type, it is not obvious. The reason is that the population of US is increasing, while the percent of this educational level is decreasing. It turns out the absolute count has no obvious change. So from this kind of view, choose “percent” as type is better than “count”.
#General view of count of people who have less than a high school diploma
mapstate("count", 1)
#General view of percent of people who have less than a high school diploma
mapstate("percent", 1)
mapcountyFunction “mapcounty” has similar output with “mapstate”, except that the map is one specific state rather than US and units are county rather than state. The input parameters are “type”, “level”, “stateshort” and “limit”. Parameters “type” and “level” have the same with the usage in function “mapstate’. Parameter “stateshort” is the two-letter abbreviated state name. Parameter “limit” means, under an educational level, if a county who has an average percentage of people greater than “limit”, the county name will be labeled; default “limit” is 28. For example, if we use following command to look for people with Bachelor’s degree in Iowa, figure below is the output. The labeled counties are adult percentage higher than 28%. In this plot, we can see for bachelor degree or higher level, there are increasing number of counties whose percentage are larger than thirty. In 1970-1990, only the percent of people with Bachelor’s degree in Story county and Johnson county are greater than 28%.
#People with Bachelor’s degree in Iowa
mapcounty("percent",levelint = 4,stateshort="IA",limit=28)
For users who are not familiar with R, shiny app is a friendly tool. Also, the shiny app could provide more intuitive user interface. There are two navigation tab in shiny app which are “Plot” and “Data”. The Plot tab contains plots on country level (“mapstate”), state level (“mapcounty”) and county level (“lineEd”). The three plots are in three panel. There are four widgets in total for user to control the variables on the Plot tab. The first one is a radio button with choices of “count” or “percent”. The second one is a select input button with choices of all state names, the third one is also a select box with choices of county names for a chosen state. The last one is a radio button of four education levels. The other navigation tab Data contains all tidy data of education level in the US. There are a few widgets for users to filter the data of interest. This table has similar function as the dataset in R. The user could select the state, county, year, education level and value type (count or percent) using select box.
Please use link below to open shiny app: https://github.com/Xiangmei21/585-project/blob/master/shiny%20app/shinyapp_final_version.R
From this study, we apply the data technologies to conduct data clean and re-organization, draw analytic graphics, apply data onto maps, and make user interactive R package and shiny app. We have a more complete and deeper understanding of the technique we learned in class after this study. We also be more skilled in using R and R packages.
The R package “Edulevel” is more suitable for users who have experience in R. For the users who are skilled at R, they could change the source code, or do more complex analysis using raw data. For users who are not familiar with R, the shiny app is a better way for visualization, and it is easier and more interesting to use shiny app. The R package “Edulevel” can truly help people to find a good location with better education environment. What’s more, it could also show the development of technology in US and advancement of every state in education aspect.
The Github repository path is: https://github.com/Xiangmei21/585-project
The Package path is: https://github.com/Xiangmei21/585-project/tree/master/Edulevel
The Shiny app path is: https://github.com/Xiangmei21/585-project/blob/master/shiny%20app/shinyapp_final_version.R
The code to clean data path is: https://github.com/Xiangmei21/585-project/blob/master/585%20project.R
The raw data path is: https://github.com/Xiangmei21/585-project/blob/master/Education.xls